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Abstract 

The patterns of polymorphisms in genomes are imprints of the evolutionary forces at play in nature. In particular, 
polymorphisms have been extensively used to infer the fitness effects of mutations and their dynamics of fixation. 
However, the role and contribution of molecular biophysics to these observations remain unclear. Here, we couple 
robust findings from protein biophysics, enzymatic flux theory, the selection against the cytotoxic effects of protein 
misfolding, and explicit population dynamics simulations in the polyclonal regime. First, we recapitulate results on the 
dynamics of clonal interference and on the shape of the DFE, thus providing them with a molecular and mechanistic 
foundation. Second, we predict that if evolution is indeed under the dynamic equilibrium of mutation -selection balance, 
the fraction of stabilizing and destabilizing mutations is almost equal among single-nucleotide polymorphisms segregate 
ing at high allele frequencies. This prediction is proven true for polymorphisms in the human coding region. Overall, our 
results show how selection for protein folding stability predominantly shapes the patterns of polymorphisms in coding 
regions. 
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Introduction 

How and why the observed patterns of DNA polymorphisms 
arise in the genome, and what are their molecular and phe- 
notypic effects, is central to our understanding of the evolu- 
tionary forces at play in nature. In public health and medicine, 
polymorphisms are crucial in inferring the origin of diseases 
and genetic traits (McCarthy et al. 2008) and in understand- 
ing the spread of pathogens such as viruses (Vignuzzi et al. 
2006). 

A major utility of polymorphisms is in estimating the dis- 
tribution of fitness effects (DFE) of mutations. Although the 
DFE has been measured for viruses (Sanjuan et al. 2004), its 
measurement in living organisms is difficult and resolution- 
limited (Eyre-Walker and Keightley 2007). Thus, studies on 
the DFE have largely relied on Bayesian maximum-likelihood 
approaches to fit population dynamic and demographic 
models to patterns of polymorphisms and amino acid differ- 
ences between species (Bustamante et al. 2005; Eyre-Walker 
et al. 2006; Sawyer et al. 2007; Kryukov et al. 2009). A consen- 
sus result is that the DFE is characteristically skewed and can 
be described by a gamma distribution (Bustamante et al. 
2005; Eyre- Walker et al. 2006; Kryukov et al. 2009). There is, 
however, a debate on the magnitude and relative balance 
between the types of substitutions that fix in the popula- 
tion — some consider them to be neutral or near neutral 
(Ohta 1973; Nei et al. 2010), whereas others consider them 
to be predominantly adaptive and beneficial (Smith and Eyre- 
Walker 2002; Eyre-Walker and Keightley 2007). Indeed, there 
is a longstanding debate (with patterns of polymorphisms 
used as empirical support by all sides) on whether evolution 



is primarily neutral (Kimura 1968; Ohta 1973; Nei et al. 2010), 
adaptive (McDonald and Kreitman 1991; Smith and Eyre- 
Walker 2002), or driven by drift (Lynch and Conery 2003). 
Distinguishing adaptive, neutral, and nearly neutral modes of 
molecular evolution remains challenging (Akashi et al. 2012) 
because the predictions are overlapping. 

The patterns of polymorphisms can also be used to gain 
insight into the dynamics of allele segregation and in deter- 
mining which mutations are eventually fixed or lost in evo- 
lution (for a practical example, see Strelkowa and Lassig 2012). 
In general, the dynamics is expectedly complicated because of 
the intrinsic stochasticity of drift and mutation, compounded 
by history and demography of the evolving population. The 
trajectories of mutations in polyclonal populations are dy- 
namically rich because of potential clonal interference, hitch- 
hiking, and/or background selection. Major advances have 
been described in recent years to infer the dynamics 
(Gerrish and Lenski 1998; Wilke 2004; Desai and Fisher 
2007), but their connection to molecular biophysics is still 
unclear. 

Most of the approaches in the studies above assume the 
DFE and then infer the possible dynamics (Gerrish and Lenski 
1998; Desai and Fisher 2007) or assume the possible dynamics 
and then infer the DFE (McDonald and Kreitman 1991; 
Bustamante et al. 2002; Smith and Eyre-Walker 2002). This 
poses a potential limitation because demography and the 
DFE are intrinsically coupled (Silander et al. 2007). More im- 
portantly, these approaches lack explicit connection to the 
molecular properties of segregating polymorphisms, such 
as folding stability, or to the widely accepted selective 
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constraints on protein evolution, such as avoidance of protein 
misfolding and misinteraction (Pal et al. 2006; Drummond 
and Wilke 2008; Zhang et al. 2008; Koonin and Wolf 2010). 

An alternative and complementary approach is to 
develop an evolutionary framework based on a realistic 
genotype-phenotype relationship and allow the patterns of 
polymorphisms, mutational dynamics, and the DFE to be 
consequences of the model. Knowledge of the genotype- 
phenotype relationship entails systematic accounting of the 
molecular properties encoded by the genome and the exten- 
sive mapping of their interactions whether physical, biochem- 
ical, or genetic. Although these relationships are overall 
complex, at least for coding regions, there is a general con- 
sensus that the fitness of the organism is a function of the 
metabolic output (Edwards et al. 2001; Duarte et al. 2007), 
itself also a function of the biophysical properties of proteins 
(Bar-Even et al. 2010). Another emerging constraint on 
protein evolution is the global selection against the 
cytotoxic effect of aggregated, presumably misfolded proteins 
(Bucciantini et al. 2002). The universality of such a constraint 
is manifested in the consistent correlation between the rate of 
protein evolution and cellular abundance (Drummond and 
Wilke 2008; Yang et al. 2010; Serohijos et al. 2012). 

To arrive at a more mechanistic origin of the patterns of 
polymorphisms that explicitly account for their biophysical 
effects, we coupled molecular biophysics, the emerging 
knowledge of the genotype-phenotype relationship, and ex- 
plicit population dynamics simulations. First, we show that 
the DFE is not constant but a dynamic consequence of the 
evolutionary process. Specifically, under the equilibrium of 
mutation-selection balance and because of the epistatic in- 
teractions between mutational effects on protein folding sta- 
bility, the DFE evolves to be concentrated around effective 
near neutrality with the characteristic gamma distribution 
(Kryukov et al. 2009). Second, even under equilibrium, we 
observe pervasive background selection and hitchhiking 
that expand the regime of effective near neutrality, consistent 
with prior studies (e.g., McVean and Charlesworth 2000; 
Neher and Shraiman 2011). Because we base our premise 
on molecular biophysics and emerging genotype-phenotype 
relationships, our approach could provide a molecular foun- 
dation to these observations. More importantly, we could also 
relate these findings in evolutionary biology to predictions of 
their molecular consequences. In particular, we predict that if 
evolution is indeed under equilibrium, the fraction of stabi- 
lizing and destabilizing mutations are almost equal among 
single-nucleotide polymorphisms (SNPs) segregating at high 
allele frequencies. Despite some simplifying assumptions, this 
prediction is proven true for polymorphisms in the human 
coding region. 

Results 

Coupling Biophysics and Population Dynamics in the 
Polyclonal Regime 

To couple population dynamics and molecular biophysics, we 
model an evolving population of Ne=10^ organisms with 
explicit genomic sequences consisting of ten open reading 



frames that code enzymes from the folate biosynthetic path- 
way (fig. 1A and supplementary table SI, Supplementary 
Material online), an essential biochemical pathway for 
amino acid synthesis. These model genes have corresponding 
3D structures from the protein databank that can be used in 
estimating the biophysical effects of mutations (see Materials 
and Methods). We assume that the fitness/ of the organism 
or its probability of replication is a function of both the total 
metabolic output (Dykhuizen et al. 1987) and the total 
number of misfolded proteins in the cell; the latter accounts 
for the cytotoxicity of misfolded proteins (Drummond and 
Wilke 2008). Thus, the total fitness is 

/ — /flux "/toxicity (1) 

From linear pathway theory, the flux term may be ex- 
pressed as (see Materials and Methods) 



10 _^ 

where S; is the enzyme efficiency. A, is the cellular abundance, 
and ACj is the folding free energy. The index / is for each gene 
in the model. We make the simplifying assumption that all 
enzymes have the same efficiency = 1. The factor ^ = 1//cbT 
(kBT= 0.593 kcal/mol) and Qq is a normalizing constant (see 
Materials and Methods). The contribution to fitness of the 
misfolding toxicity may be expressed as (Serohijos et al. 2012) 




where c=^0~'^ is the fitness cost per misfolded protein 
(Drummond and Wilke 2008). In this formulation, the opti- 
mal fitness is 1 and occurs in the regime where proteins are 
very stable (supplementary fig. SI, Supplementary Material 
online). Equations (1)-(3) constitute an explicit biophysics- 
based genotype-phenotype relationship. The model also fea- 
tures epistasis between genes because they are all coupled in 
the fitness function (eqs. 1-3). 

We coupled the genotype-phenotype relationship to an 
evolutionary dynamics model that includes mutation, 
selection, and drift (see Materials and Methods and supple- 
mentary fig. S2, Supplementary Material online). Specifically, 
at each replication event, a cell divides into two daughter cells, 
each can potentially mutate at the rate of /x = 0.01 /genome/ 
replication (1.5 x 10~^/base pair/replication). If the mutation 
is nonsynonymous, we estimate the change in folding 
stability (AAC = ACmutant- ACwiidtype) using a physical 
force field (Yin et al. 2007) and then update the fitness of 
the organism (see Materials and Methods). We ran our 
simulation until it achieved mutation-selection balance 
(fig. IB). Throughout the simulation run, we saved the full 
history of all arising mutations and the genomes of all 
surviving individuals (see Materials and Methods). Analysis 
was performed only in the regime of mutation-selection 
balance. 
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Fig. 1. Model of protein evolution that couples biophysics and population dynamics in the polyclonal regime. (A) A model of organism composed of 
ten genes from the folate biosynthetic pathway (supplementary table S1, Supplementary Material online). Cellular fitness / is proportional to the 
effective metabolic output of this pathway and the total number misfolded proteins (eqs. 1-3). The population is subject to mutation, drift, and 
purifying selection. Mutations can change the folding stability AG of a gene and hence the fitness of the cell (see Materials and Methods). Effective 
population size is - 10^. (B) Fitness under mutation-selection balance. (C) Detailed trajectory of the folding stability of each gene in individuals in the 
population. Individual cells are indexed along the y axis, where spatial proximity is proportional to kinship. 



Mutation-Selection Balance Under Pervasive Clonal 
Interference 

We then analyzed the various types of mutations and the 
dynamics of their segregation in the population. Because the 
simulations are performed at high mutation rate, there is 
pervasive clonal interference (fig. 1C). We classified all arising 
mutations according to their fitness effect, quantified by the 
selection coefficient s = (f^utant -/wiidtype)//wiidtype. Because 
the fitness function is protein-centric, mutations that increase 
folding stability (A AC < 0) are beneficial, whereas those that 
decrease stability (AAC>0) are deleterious. Synonymous 
substitutions are considered neutral. In figure 2A, we show 
typical trajectories of mutations that eventually fixed in the 
population (see also supplementary fig. S3, Supplementary 
Material online). In the regime of high mutation rate, several 
mutations arise over the lifetime of a segregating allele (fig. 1C 
and 2A). The distribution of minor allele frequencies for the 
SNPs in simulation and the human coding region are shown 
in supplementary figure S4, Supplementary Material online. 

Our simulations exhibit clonal sweeps, characterized by the 
correlated fixation of mutations, usually driven by a beneficial 



mutation (fig. IE). Such clonal sweeps are typically character- 
ized by a slow rise followed by a rapid drop in polymorphisms 
(fig. 2C). The anatomy of such sweeps entails deleterious mu- 
tations hitchhiking on the beneficial mutations; consequently, 
these deleterious mutations now have a significant probabil- 
ity of fixation compared with the monoclonal regime (fig. 3). 
Beneficial mutations, however, do not fix as likely as in the 
monoclonal regime because they now arise in the context of 
many deleterious mutations (fig. 4A, C, and £; supplementary 
fig. S3, Supplementary Material online). The deleterious hitch- 
hikers effectively dampen a beneficial mutation's overall fit- 
ness effect, thus lowering its probability of fixation (fig. 3). The 
extent of hitchhiking by destabilizing mutations on stabilizing 
ones can be estimated from the distribution of A AC for all 
possible arising nonsynonymous mutations available to a 
wildtype sequence. This distribution appears to be universal 
across types of protein folds (Tokuriki et al. 2007). 
Approximately 20% of nonsynonymous mutations are stabi- 
lizing (A AC < 0), whereas the rest are destabilizing (Tokuriki 
et al. 2007; Zeldovich et al. 2007); see also the blue curve in 
(fig. 6B). The extent of hitchhiking and background selection 
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Fig. 2. Mutation-selection balance under pervasive clonal interference. (A) Sample history of mutations that reached fixation. Time interval corre- 
sponds to figure 1C. Arrows indicate the correlated fixation of mutations. (Only three genes are shown; see supplementary fig. S3, Supplementary 
Material online, for the complete trajectory.) (B) Selection coefficients of mutations that successfully fixed. (C) Extent of polymorphism in the evolving 
population. Drop in diversity accompany clonal sweeps. (D) Distribution of SNPs in the simulation. All SNPs (blue); SNPs with allele frequencies >1% 
(black). 
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Fig. 3. Probability of an arising mutation to reach an allele frequency X. 
A = 1 corresponds to fixation. Red line is the probability of fixation in the 
monoclonal regime. Fixation probability of a neutral mutation in the 
monoclonal regime (1/2Ne) is indicated. Interference among clones 
takes two specific forms: background selection and hitchhiking. 

is generally a function of mutation rate and population size; 
nonetheless, from purely biophysical considerations, there are 
potentially four destabilizing mutations that could hitchhike 
for every stabilizing mutation. 



We also show the probability that a mutation reaches an 
allele frequency k (fig. 3). Interestingly, alleles segregating at 
~50% are almost determined to fix (fig. 3). We note that the 
pervasive clonal interference in our simulation occurs under 
mutation-selection balance and is distinct from the more 
common treatment of clonal interference in literature, 
which is only among beneficial mutations and specifically in 
the context of adaptation (Gerrish and Lenski 1998; Fogle 
et al. 2008). 

Fitness and Molecular Effects of Mutations Under 
Mutation-Selection Balance 

As noted earlier, one of the primary utilities of the patterns of 
polymorphism in genomes is quantitatively estimating the 
DFE. Thus, we next explore the resulting DFE from our sim- 
ulation and compare the distribution with estimates from 
Bayesian approaches. 

In the genotype-phenotype relationship defined by equa- 
tions (1)-(3), epistatic interactions on folding stability play a 
crucial role in determining the fitness effects of mutations 
(fig. 4). Specifically, in our model, the fitness effect of a mu- 
tation with a molecular effect A AC depends on the folding 
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Stability of the current wildtype or ACpremutation (fig- ^\ The 
same AAG can have very near neutral effect if it occurs 
in proteins that are stable but can have sizable effects if 
it occurs in proteins that are unstable. For example, a desta- 
bilizing mutation of AAG = 1 kcal/mol occurring in genes 
with AGpremutation = "8 kcal/mol has a fitness effect of 
Ns ^ —10""^; however, the same mutation occurring in 
genes with AGpremutation = " 0-5 is lethal. A stabilizing 
mutation of AAG = —1 kcal/mol occurring in genes with 
AGpremutation = "8 kcal/mol has a fitness effect of 
Ns ^ + 10""*; however, if it occurs in genes with 
AGpremutation = kcal/mol, the mutation is extremely 
beneficial Ns ^ + 10^. Thus, in the regime where proteins 
are stable, both destabilizing and stabilizing mutations have 
N I s I < < 1 ; however, because of the larger supply of desta- 
bilizing than stabilizing mutations, most mutations that fix 
are destabilizing. This imbalance gives rise to a mutational 
drift of AG toward less stable proteins and away from the 
flatter part of the fitness landscape (fig. 4/and supplementary 



fig. 52, Supplementary Material online). In the regime where 
proteins are less stable, selection for stabilizing and selection 
against destabilizing mutations lead to fixation of a larger 
fraction of stabilizing mutations (fig. 4/ and supplementary 
fig. S2, Supplementary Material online). This dominance of 
selection drives AG toward more stable proteins and away 
from the less fit part of the fitness landscape (supplementary 
fig. S2, Supplementary Material online). The balance between 
selection and drift occurs at some intermediate folding sta- 
bility (fig. 4/, ~4 kcal/mol), where stabilizing and destabilizing 
mutations have equal likelihood of being fixed. 

Because mutation-selection balance is a dynamic equilib- 
rium, the protein finds itself on the left or right hand side of 
~4 kcal/mol, but on average, it resides in this neighborhood, 
giving rise to the observation that proteins are "marginally 
stable" (Taverna and Goldstein 2002; Bloom et al. 2007; 
Zeldovich et al. 2007). The balance between drift and selec- 
tion defines the mode of the equilibrium distribution of fold- 
ing stabilities (fig. 4£). This equilibrium distribution is in 
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agreement with the empirically measured AC distribution in 
ProTherm (Kumar et al. 2006), as pointed out in earlier works 
(Zeldovich et al. 2007; Wylie and Shakhnovich 2011). 
Additionally, for protein coding regions, the strictly neutral 
regime (s = 0) is not a stable attractor in protein evolution 
because mutational drift due to nonsynonymous substitu- 
tions always drives proteins toward marginal stability (figs. 
4F and 5) (Taverna and Goldstein 2002). 

Because the folding stability AC evolves as consequence of 
mutation-selection balance, so should the fitness effect s, 
which itself is a function of folding stability. Thus, the DFE 
is expectedly a dynamic consequence of the resulting popu- 
lation dynamics. Shown in figure 5B and C are the resulting 
DFE of arising random mutations and fixed nonsynonymous 
substitutions. The DFE of fixed deleterious mutations is 
bounded on one side by drift away from the neutral regime 
(i.e., drift from stable AC) and on another side by selection 
(fig. 4C). The DFE of fixed beneficial mutations is more 
nuanced. It is bounded on one side by drift and limited on 
the other side by the supply of stabilizing mutations. These 
stabilizing and beneficial mutations have only effectively 
near neutral effect in the background of the folding stability 
values under mutation-selection balance (fig. 4B). In short, 
the observation of marginal folding stabilities of proteins is 
coupled to the effective near neutrality of the fitness effects of 
fixed amino acid substitutions. 

Additionally, under mutation-selection balance, the mag- 
nitude of the selection coefficient is of near neutral effect, 
whereas the magnitude of the folding stability changes is far 
from neutral (fig. 5A). The nonneutrality of the molecular 
effect (A AC) and the near neutrality of fitness effect s are 
due to the background AC, which evolves to an equilibrium 
distribution that ensures the near neutrality of the fixed del- 
eterious and beneficial mutations (fig. 5A). The effective near 
neutral theory, originally a postulate (Ohta 1973), finds a solid 
and mechanistic foundation in protein biophysics and the 
selection against protein misfolding and selection for function 
due to metabolic flux. 

We note that the resulting DFE from simulations is skewed 
and can be fitted to a gamma distribution (fig. 5C and D), in 
agreement with studies that estimated the DFE using maxi- 
mum likelihood methods on human and in flies (Bustamante 
et al. 2005; Eyre-Walker et al. 2006; Kryukov et al. 2009). 
Similar works have also shown that most mutations 
could be of near neutral effect (Bustamante et al. 2005; 
Eyre-Walker et al. 2006; Sawyer et al. 2007; Kryukov et al. 
2009); however, the molecular basis was unclear. Our work 
provides a molecular and mechanistic origin of these obser- 
vations based on the emerging genotype-phenotype rela- 
tionships (eqns. 1-3). 

Mutation-selection balance is a dynamic equilibrium; 
thus, there should be equal numbers of fixed beneficial and 
deleterious mutations, a result hypothesized as early as 1930 
by Fisher (1930) and articulated recently in the monoclonal 
regime by some groups (Sella and Hirsh 2005; Mustonen and 
Lassig 2009). Our own simulations in the monoclonal regime 
with a biophysics-based genotype-phenotype relationship 
(Serohijos et al. 2012) also confirm this hypothesis 



(supplementary figs. S7 and S8, Supplementary Material 
online). We show that despite the more complicated dynam- 
ics in the polyclonal regime, this inference is robust as man- 
ifested by the bimodal distribution of the selection 
coefficients of fixed mutations (fig. 5C). In strictly monoclonal 
populations, the boundary of near neutrality is at N | s | ~ 1 
(Sella and Hirsh 2005; Goldstein 2011; see also supplementary 
figs. S7 and S8, Supplementary Material online). However, in 
the polyclonal regime, because of extensive hitchhiking and 
background selection that effectively lead to a flatter proba- 
bility of fixation (fig. 3), the bounds of effective near neutrality 
extend beyond N | s | ~ 1 (fig. 5A, C). 

Stability Effects of Nonsynonymous SNPs Segregating 
at Various Allelic Frequencies 

We have shown that the inferred DFE (Bustamante et al. 2005; 
Eyre- Walker et al. 2006; Kryukov et al. 2009) can be explained 
under mutation-selection balance. Because this result from 
simulations seem to contradict the large body of literature 
arguing for a predominantly adaptive (hence, out of equilib- 
rium) tempo of protein evolution (McDonald and Kreitman 
1991; Smith and Eyre- Walker 2002), we then sought to estab- 
lish another empirical support for our analysis. 

We know the full history of all mutations, thus we can 
relate the stability effects (A AC) of SNPs to their allele fre- 
quencies (fig. 6A). Most arising mutations are destabilizing, 
and those SNPs segregating at low frequencies are still pre- 
dominantly destabilizing (fig. 6A). This high fraction of desta- 
bilizing mutations among low-frequency SNPs is directly 
supported by explicit biophysical measurements of the sta- 
bility effects of SNPs from a diverse set of 16 human enzymes 
(Allali-Hassani et al. 2009) and by bioinformatics analysis (Yue 
and Moult 2006). This result is also in agreement with the 
observation that disease-associated SNPs, because of their 
very deleterious effects, segregate at lower frequencies than 
regular polymorphisms (De Baets et al. 2012). On the other 
hand, among SNPs segregating at higher allele frequencies, the 
fraction of destabilizing SNPs decreases because of purifying 
selection. In particular, for SNPs segregating at 40% allele fre- 
quency, close to the probability of fixation (fig. 3), the fraction 
of stabilizing and destabilizing SNPs are almost equal (fig. 6A). 
The estimates of the folding stability effects of SNPs in the 
human coding region (fig. 6B) indeed show the increasing 
manifestation of purifying selection among SNPs of higher 
allele frequencies. Most importantly, arguing for mutation- 
selection balance (at least for protein evolution), SNPs that 
are close to fixation approach the limit of equal fraction of 
stabilizing and destabilizing A AC (fig. 6B). 

Discussion 

By developing an evolutionary model based on molecular 
biophysics and on an intuitive genotype-phenotype relation- 
ship, we provide a more mechanistic and molecular under- 
standing on how polymorphisms could arise and segregate in 
the coding region of genomes. Several works have tried to 
bridge molecular biophysics and population genetics both in 
coding (DePristo et al. 2005; Bloom et al. 2007; Drummond 
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and Wilke 2008; Goldstein 2011; Serohijos et al. 2012) and 
noncoding (Mustonen et al. 2008) regions of genomes. 
However, these studies are in the monoclonal regime and 
do not explore the relevance of biophysics to polymorphisms 
as we do here. 

The DFE is not a constant in evolution but an evolvable 
property and a consequence of the evolutionary dynamics. In 



this work, we showed how, in the context of coding region 
evolution, the protein folding stability evolves to ensure the 
near neutrality of the fitness effects of stabilizing mutations, 
reaching a dynamic steady state defined by the mutation- 
selection balance. Specifically, the DFE evolves to be centered 
around effective near neutrality with a characteristic skewed 
gamma distribution. The boundaries of effective near 
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neutrality are strongly determined by the population dynann- 
ics; in this case, pervasive clonal interference leads to weak- 
ened selection that expands the regime of near neutrality. 
This effective weakening of selection due to hitchhiking has 
been described previously (e.g., McVean and Charlesworth 
2000; Neher and Shraiman 2011). However, here, we provide 
a mechanistic and direct connection on how it can arise in 
the context of protein evolution. The extent of the weakening 
of selection is expectedly a function of population size and 
mutation rate (Neher and Shraiman 2011; Wylie and 
Shakhnovich 2011) — in this work, we chose the population 
size of 10^ organisms, which we tried to be as close the human 
effective population size (10^^) (Lynch and Conery 2003) but 
still computationally tractable. 

Under this dynamic equilibrium of mutation-selection 
balance, the near neutral theory (Ohta 1973) is not a postu- 
late but a robust consequence of the interplay between bio- 
physics and evolutionary dynamics. The standard molecular 
argument for the claimed neutrality of most mutations is that 
a significant fraction of residues in a protein (~85%) do not 
participate in the active site thus unrelated to function. 
However, this is inconsistent with molecular biophysics 
where mutations are never neutral, as they always affect fold- 
ing stability (Tokuriki et al. 2007; Zeldovich et al. 2007) and 
other molecular properties of proteins such as their interac- 
tions with other proteins in cytoplasm (Vavouri et al. 2009; 
Heo et al. 2011). Here, we have shown the despite the 
non neutral effects of mutations at the level of macromole- 
cules, the population evolve to ensure the near neutrality of 
their fitness effects (fig. 5). 

We also note the major distinctions between our work and 
the theoretical models that advocate selection for mutational 
robustness (van Nimwegen et al. 1999; Wilke et al. 2001). First, 
these neutral network models assume that mutations are 
either neutral or lethal. The relative fraction of neutral to 
lethal neighbors defines the degree of mutational robustness. 
In our model, no a priori assumptions are made on the DFE. 
As argued above, the same AAC mutations could have a 
fitness effect ofbeN|s| >>1orN|s| <<1 depending 

on background ACpremutation- 

Second, these theoretical and computational models also 
assume that there are multiple peaks in the fitness land- 
scape — the flattest, most mutationally robust peak is distinct 
from the highest peak, which could be less robust. In our, 
genotype-phenotype model based on the thermodynamics 
of protein folding stability, the regime that is most robust is 
also the regime that is most fit (supplementary fig. S2, 
Supplementary Material online), and this is the regime of 
high folding stability. We note, however, that under our 
model of mutation-selection balance, proteins evolve toward 
marginal stability, hence organisms are not optimally fit (sup- 
plementary fig. S2, Supplementary Material online). 

Third, in the models arguing for mutational robustness, 
under low mutation rate, the population evolves to higher 
peaks even if less robust. Under high mutation rate, the pop- 
ulation evolves to the flatter peaks, because selection for 
mutational robustness outweighs selection for fitness, hence 
the "survival of the flattest." In both cases of high and low 



mutation rates, evolution is always a process of optimiza- 
tion — high robustness at high mutation rate or high fitness 
under low mutation rate. In our model, however, the evolu- 
tion is always toward mutation-selection balance, where pro- 
teins are marginally stable (fig. 4), the fitness effects are near 
neutral (fig. 5), and the organisms are not optimally fit (sup- 
plementary fig. S2, Supplementary Material online). This evo- 
lution toward marginal folding stability and suboptimal 
fitness holds under low mutation rate, where the bounds of 
near neutrality is N | s | ~ 1 (supplementary figs. S7 and S8), 
and under high mutation rate, where the bounds of near 
neutrality is greater than N | s | ~ 1 because of hitchhiking 
and background selection (fig. 5). Of course, when the muta- 
tion rate is very high or when the population size is too small, 
the condition for mutation-selection balance may not be 
satisfied leading to extinction (Zeldovich et al. 2007; Wylie 
and Shakhnovich 2011). In short, in our model, evolution is 
not necessarily a process of optimization. 

Fourth, in the neutral network models of sequence 
evolution, the most neutral part of the landscape represents 
a stable attractor. In our model, however, the flattest part 
of the landscape is not an attractor because of mutational 
drift. Thus, evolution proceeds "toward near neutrality" 
is the correct description rather than simply "toward 
neutrality." 

Altogether, the terms neutral and near neutral is rather 
unfortunate, because they suggest that the latter is simply an 
update or a correction to the neutral theory. However, as 
noted above, there are fundamental mechanistic and concep- 
tual differences between the two, and it is the near neutral 
theory that is most consistent with protein biophysics. 

We have also shown that the patterns of polymorphisms, 
when framed in very direct observables such as changes in 
folding stability, in fact, support the argument for a predom- 
inantly nonadaptive tempo of evolution (Bustamante et al. 
2005), contrary to prior claims resulting from the so-called 
tests of neutrality (McDonald and Kreitman 1991; Smith and 
Eyre-Walker 2002). In the future, to further reconcile the 
adaptive view of evolution and the effective near neutrality 
(as argued here), an extensive analysis of polymorphisms 
generated from our approach and the McDonald-Kreitman 
tests must follow. Additionally, biophysical analysis of SNPs 
in model organisms across all kingdoms of life will systemat- 
ically test the universality of the results demonstrated in 
figure 6B. 

We explicitly discussed the mechanism for how the DFE of 
coding region mutations becomes centered around effective 
near neutrality under mutation-selection balance. This result 
may be extended to the noncoding region because the emer- 
gent near neutrality under mutation-selection balance is 
robust to the details of the fitness function, as long as the 
genotype-phenotype relationship features a convex 
curved functional form reflecting the diminishing fitness im- 
provement upon further optimization of molecular proper- 
ties (Akashi et al. 2012). In the noncoding region, a "curved" 
genotype-phenotype relationship arises from the ability 
of replication-related proteins (such as polymerases, tran- 
scription factors) to bind to DNA or RNA (Mustonen 
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and Lassig 2009). In this respect, the thermodynamics of 
protein-DNA interaction is analogous to the thermodynam- 
ics of protein folding. 

The molecular view of mutation-selection balance 
described here also clarifies a meaningful distinction (previ- 
ously pointed out in Mustonen and Lassig [2009]) between 
positive selection (existence of beneficial mutations that 
eventually outcompetes wildtype) and true adaptation 
(moving selection target). Indeed, here, there is an ample 
supply of beneficial mutations originating from constantly 
arising stabilizing mutations; however, these beneficial 
mutations are not truly adaptive but only maintain 
mutation-selection balance. That is, the observation among 
biophysicists that proteins are marginally stable (Privalov 
1979; Taverna and Goldstein 2002; Kumar et al. 2006; 
Bloom et al. 2007; Zeldovich et al. 2007) and the observation 
in evolutionary biology that coding region evolution is pre- 
dominantly nonadaptive (e.g., in human; Bustamante et al. 
2005) are the molecular and phenotypic manifestation of the 
balance between drift and selection for folding stability. 

The nature and shape of the DFE should depend on the 
mutation rate, population size, and number of genes in 
the organism. However, the natural expectation is that the 
higher population sizes and higher mutation rates increase 
the extent of clonal interference and thus could in principle 
further expand the bounds of near neutrality. The systematic 
effects of mutation rate, population, and the number of 
genes on polymorphisms in the context of this biophysics- 
based population dynamics model will be explored in future 
studies. 

The approach we present here can only be improved as we 
become more quantitative and systematic in our understand- 
ing of the genotype-phenotype relationship and integrate it 
into a comprehensive cellular model (Karr et al. 2012). The 
explicit genotype-phenotype relationship could be the start- 
ing point for investigating the evolutionary consequences of 
the cellular quality control machinery (chaperones and pro- 
teases) that can modulate the fitness effects of mutations and 
hence the expected patterns of polymorphisms. Additionally, 
a realistic cellular model representing more complete prote- 
omic and metabolic network information could explore 
the relationship between the DFE per gene and the DFE 
on the whole organism (Soskine and Tawfik 2010). 

We note that this approach of coupling molecular bio- 
physics and population genetics has already been crucial in 
explaining other emerging genomic patterns, such as why 
highly abundant proteins consistently evolve slowly 
(Pal et al. 2001; Drummond and Wilke 2008; Yang et al. 
2010; Serohijos et al. 2012) or tend to be more stable 
(Drummond and Wilke 2008; Cherry 2010; Serohijos et al. 
2012, 2013). Because the representation of the evolving pop- 
ulation is explicit, our approach could also provide a frame- 
work to account for the role of changing environments. 
We also believe that this approach could provide an explicit, 
mechanistic null model for statistically inferring muta- 
tions that are truly functional and adaptive (Kumar et al. 
2012). 



Materials and Methods 

Fitness Function 

To begin with the most basic model of evolution that has 
some semblance of realism in accounting for the biophysical 
properties of proteins and the genotype-phenotype relation- 
ship, we choose to model an organism (supplementary 
table SI, Supplementary Material online) based on a core 
metabolic pathway and postulate that its fitness is propor- 
tional to the metabolic flux (Milo and Last 2012). Assuming 
that all the enzymes follow a linear metabolic pathway, the 
fitness due to flux/flux = J]/'=i c' "^^^^^ ^ number of 
input metabolites, Sj is the enzymatic efficiency, and C, is the 
number of functional copies. The functional copies 
correspond to number of enzymes in the folded (native) 
state C/ = A/Pnat> where Ppat,/ is the Boltzmann probability 
of the protein / to be in the native state and A, is the total 
concentration of protein / in the cell. Assuming a two- 
state folding thermodynamics P^^^ = e'^^^' /(^ + e'^^^') 
(Privalov 1979). 

Another emerging constraint in protein evolution is the 
global selection against the cytotoxicity of protein misfolding 
(Drummond and Wilke 2008; Serohijos et al. 2013). Formally, 
the fitness due to toxicity is /toxicity = cJ^Z^ ^' ('" " ^nat,,). 
where c = ^0~^, the fitness cost per misfolded protein 
(Drummond and Wilke 2008; Geiler-Samerotte et al. 2011). 
Altogether, the total fitness is described by equation (1). 
Without loss of generality, we require that fitness is 
optimally 1 at very stable regimes, /(AC — oo) = 1, 

leading to a = 1/E/=i (^')"'- When /toxicity > /flux, the fit- 
ness is defined to be /= 0, hence / > 0. The resulting 
fitness defined by equations (1)-(3) is essentially parame- 
ter-free. 

In our earlier work (Serohijos et al. 2012), the goal was to 
determine whether selection against the cytotoxity of protein 
misfolding is sufficient to explain the widely observed abun- 
dance-evolutionary rate correlation (Drummond and Wilke 
2008). Thus, to make an explicit comparison and connection 
with earlier literature, we only focused on the selection against 
protein misfolding. In this study, we generalize the fitness 
function to include the notion of selection for more func- 
tional copies, motivated by numerous works that map met- 
abolic output to fitness (Edwards et al. 2001; Duarte et al. 
2007), which depends on the biophysical properties of pro- 
teins (Bar-Even et al. 2010). 

We note that in our model, there is epistasis between 
genes because they are coupled in the nonlinear fitness 
function (eqs. 1-3). In the fitness effect s = (/mutant/ 
/premutation) " 1. the value of /premutation IS determined by the 
biophysical properties of all genes in the cell. Thus, the 
quantitative effect of a prior mutation in one gene could 
influence the fitness effect of the current mutation in 
another gene. The epistasis is strongest when mutations fall 
on the genes with low folding stability, because this is where 
the curvature of the fitness landscape is most pro- 
nounced (supplementary fig. S2, Supplementary Material 
online). 
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Evolutionary Dynamics with Mutation, Selection, 
and Drift 

We follow the standard Moran process in the evolutionary 
simulations (supplementary fig. SI, Supplementary Material 
online). The fitness function (eq. 1) defines the replication 
rate. We make no prior assumption on the dynamics and/or 
the resulting DFE. To update the evolution of the organism, 
we use the Gillespie algorithm: At each replication event, the 
replicating cell splits into two daughter cells, each can poten- 
tially mutate at the rate of /x = 0.01 /genome/replication. If the 
mutation is nonsynonymous, the folding stability of the 
mutated gene is updated with A AC values estimated using 
a physical force field and the 3D structures as input (discussed 
later). We performed this simulation of mutation, selection, 
and drift toward the dynamic equilibrium imposed by muta- 
tion-selection balance (fig. IB). 

All the genes are initialized with folding stability 
ACo= — 5 kcal/mol at time t = 0. This dynamic equilibrium 
of mutation-selection balance is robust to the choice of 
starting ACq. When the proteins are initialized at the very 
stable regime, these genes will migrate toward less stable 
regime because of drift. On the other hand, when the proteins 
are initialized in the less stable regime, the genes migrate 
toward greater folding stability because of selection. All anal- 
ysis reported are performed only after the population has 
reached mutation-selection balance (fig. 1). 

After the simulation reaches mutation-selection balance, 
we save the information of all arising mutations, including 
their location in the genome, nucleotide change, A AC, and s. 
We also save the genomes of all cells in the population every 
10 generations (i.e., every lONe replication events). The infor- 
mation allows us to reconstruct the trajectories and all arising 
mutations. 

In our earlier work (Serohijos et al. 2012), where N/x ^ 1 
and the population is monoclonal, we employed 
Wright- Fisher's discrete nonoverlapping generations model. 
Also, in the earlier work, we defined the fitness 
2iS w = exp[— c(total count of misfolded proteins)] follow- 
ing Drummond and Wilke (2008). In the present work, 
where we wish to determine the dynamics of segregation 
and clonal interference between alleles, we perform our 
simulations in the overlapping generations model where 
the fitness (eq 1) is of the Malthusian type. Both fitness def- 
initions satisfy the transformation //via/th™ = ^^ifwhght- fisher) 
(Orr 2009). 

Updating the Folding Stability During 
Nonsynonymous Substitutions 

When a nonsynonymous substitution occurs, we update the 
folding stability according to AC = ACq + AAC,, 
where ACq is the stability of the protein at time t = 0, 
AAC, is the estimated change in folding stability due to a 
single point mutation, and n is the total number of amino 
acid differences of the current sequence with respect to the 
sequence at time = 0. AAC is estimated using a physical force 
field (Yin et al. 2007). Our protocol assumes that the effect on 
folding stability of multiple mutations is simply the additive 



effect of all mutations acting independently (Fersht et al. 
1992). 

This assumption cannot accurately predict the AC of a 
sequence that is significantly diverged from the sequence at 
time = 0. Indeed, we do not claim that when a specific 
sequence from simulation is experimentally expressed and 
purified, the measured AC is similar to the one predicted 
in simulation. Ideally, one could calculate the AAC using the 
3D structure as input as soon as these mutations arise in the 
population during an evolutionary run. However, this imple- 
mentation is currently computationally intractable because 
estimation of the AAC of one mutation takes ~5min per 
mutation, and the evolutionary simulation evaluates 10^ 
mutations. The assumption of linearity, however, does not 
compromise biophysical correctness because the model cap- 
tures the essential features of protein evolution. 

First, these force fields coupled with selection for folding 
stability can recapitulate the extent of per site amino acid con- 
servation among naturally occurring homologous sequences 
(Ding and Dokholyan, 2006). Indeed, this is one important 
test during the development of these biophysical tools. 

Second, for any "wildtype" sequence in our simulation, 
the distribution of A AG for all arising single amino acid 
mutations is consistent with the A AG distribution for 
random mutations in naturally occuring sequences 
(Tokuriki et al. 2007). This feature is important because it 
quantitatively determines the balance between the supply 
of destabilizing and stabilizing mutations and the strength 
of the mutational drift. 

Analysis of Simulation Trajectories 
After simulation, we trace the history of each arising mutation 
by counting the number of individuals that exhibit the muta- 
tion in the future generations. The tracing of mutational 
history ends when the mutation fixes in the population or 
is lost to random drift (fig. 2 and supplementary fig. S3, 
Supplementary Material online). From these trajectories, we 
calculate the correlated fixation events (fig. IB). To estimate 
the probability of fixation (fig. 3), we group the trajectories 
according to their allele frequencies and selection coefficients. 
The estimated probability is the number of trajectories that 
reached a given allele frequency X divided by all arising tra- 
jectories of that given selection coefficient. 

We use Matlab to fit a gamma distribution to the DFE. 
Also, we follow the standard procedure in evolutionary biol- 
ogy to model the demography of an asexual organism in 
estimating the DFE among sexual organisms such as human 
and flies (Bustamante et al. 2005; Eyre-Walker et al. 2006; 
Kryukov et al. 2009). 

Bioinformatics Analysis 

For bioinformatics analysis of the SNPs in the human genome, 
we use the database dBSNP (Sherry et al. 2001) for allele 
frequency information and SNPEffect (De Baets et al. 2012) 
for biophysical estimation of the AAC. SNPEffect used the 
3D structure of the protein as input to FoldX (Schymkowitz 
et al. 2005). We group the SNPs according to their allele 
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frequencies and plotted the distribution of their A AC 
(fig. 6B). For arising A AC (fig 6B, blue line), we use the em- 
pirically measured A AC from a diverse set of proteins in the 
ProTherm database (Kumar et al. 2006; Tokuriki et al. 2007). 

Simulations in Monoclonal Populations as Control 
To serve as a control and show that the results in this study 
are robust to the modeling of the cell or to the formal equa- 
tion of the fitness function, we perform simulations of an 
evolving monoclonal population. The methods are described 
in greater detail in a recent work (Serohijos et al. 2012). Briefly, 
in the monoclonal simulations, the cell is composed of 10^ 
genes, each with protein abundances ranging from 10 to 10^ 
copies per cell to recapitulate the broad distribution of abun- 
dances in real organisms (Ghaemmaghami et al. 2003; 
Ishihama et al. 2008). The effective population size in the 
monoclonal simulation is Ne = ^0^. Effects of folding stability 
are estimated from the Gaussian distribution of A AC whose 
parameters are derived from the ProTherm database (Kumar 
et al. 2006). The evolutionary dynamics likewise include mu- 
tation, selection, and drift (see Serohijos et al. [2012] for de- 
tails). This specific approach, which recapitulates the universal 
observation of highly expressed proteins evolving slowly 
(Drummond and Wilke 2008; Serohijos et al. 2012) or that 
highly expressed proteins tend to be more stable (Cherry 
2010; Serohijos et al. 2013), also exhibits the near neutrality 
of the fixed beneficial mutations (supplementary figs. S7 and 
S8, Supplementary Material online) and the equal partitioning 
of fixed mutations into A AC > 0 and A AC < 0. 

Supplementat7 Material 

Supplementary figures S1-S8 and table SI are available at 
Molecular Biology and Evolution online (http://www.mbe. 
oxfordjournals.org/). 
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